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We deliver a call to arms for probabilistic numerical 
methods: algorithms for numerical tasks, including 
linear algebra, integration, optimization and solving 
differential equations, that return uncertainties in 
their calculations. Such uncertainties, arising from the 
loss of precision induced by numerical calculation 
with limited time or hardware, are important for 
much contemporary science and industry Within 
applications such as climate science and astrophysics, 
the need to make decisions on the basis of 
computations with large and complex data has led 
to a renewed focus on the management of numerical 
uncertainty We describe how several seminal classic 
numerical methods can be interpreted naturally 
as probabilistic inference. We then show that the 
probabilistic view suggests new algorithms that can 
flexibly be adapted to suit application specifics, 
while delivering improved empirical performance. 
We provide concrete illustrations of the benefits of 
probabilistic numeric algorithms on real scientific 
problems from astrometry and astronomical imaging, 
while highlighting open problems with these new 
algorithms. Finally, we describe how probabilistic 
numerical methods provide a coherent framework for 
identifying the uncertainty in calculations performed 
with a combination of numerical algorithms (e.g. both 
numerical optimisers and differential equation solvers), 
potentially allowing the diagnosis (and control) of error 
sources in computations. 
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Probability theory is the quantitative framework for scientific inference [21]. It codifies how 
observations (data) combine with modelling assumptions (prior distributions and likelihood 
functions) to give evidence for or against a hypothesis and values of unknown quantities. There 
is continuing debate about how prior assumptions can be chosen and validated [e.g. 19, §1.2.3]; 
but the role of probability as the language of uncertainty is rarely questioned. That is, as long as 
the subject of inference is a physical variable. What if the quantity in question is a mathematical 
statement, the solution to a computational task? Does it make sense to assign a probability measure 
p{x) over the solution of a linear system of equations Ax = bii A and b are known? If so, what is 
the meaning ofp(x), and can it be identified with a notion of 'uncertainty'? If one sees the use of 
probability in statistics as a way to remove "noise" from "signal", it seems misguided to apply it to 
a deterministic mathematical problem. But noise and stochasticity are themselves difficult to define 
precisely. Probability theory does not rest on the notion of randomness {aleatory uncertainty), but 
extends to quantifying epistemic uncertainty, arising solely from missing information^. Connections 
between deterministic computations and probabilities have a long history. Erdos and Kac [5] 
showed that the number of distinct prime factors in an integer follows a normal distribution. 
Their statement is precise, and useful for the analysis of factorization algorithms [24], even 
though it is difficult to "sample" from the integers. It is meaningful without appealing to the 
concept of epistemic uncertainty. Probabilistic and deterministic methods for inference on physical 
quantities have shared dualities from very early on: Legendre introduced the method of least- 
squares in 1805 as a deterministic best fit for data without a probabilistic interpretation. Gauss' 1809 
probabilistic formulation of the exact same method added a generative stochastic model for how 
the data might be assumed to have arisen. Legendre's least-squares is a useful method without the 
generative interpretation, but the Gaussian formulation adds the important notion of uncertainty 
(also interpretable as model capacity) that would later become crucial in areas like the study of 
dynamical systems.^ 

Several authors [4, 35, 42] have shown that the language of probabilistic inference can be 
applied to numerical problems, using a notion of uncertainty about the result of an intractable 
or incomplete computation, and giving rise to methods we will here call probabilistic numerics^. 
In such methods, uncertainty regularly arises solely from the lack of information inherent in the 
solution of an "intractable" problem: A quadrature method, for example, has access only to finitely 
many function values of the integrand; an exact answer would, in principle, require infinitely many 
such numbers. Algorithms for problems like integration and optimization proceed iteratively, each 
iteration providing information improving a running estimate for the correct answer. Probabilistic 
numerics provides methods that, in place of such estimates, update probability measures over 
the space of possible solutions. As Diaconis noted [4], it appears that Poincare proposed such an 
approach already in the 19th century. The recent explosive growth of automated inference, and the 
increasing importance of numerics for science, has given this idea new urgency. 

This article connects recent results, promising applications, and central questions for 
probabilistic numerics. We collate results showing that a number of basic, popular numerical 
methods can be identified with families of probabilistic inference procedures. The probability 
measures arising from this new interpretation of established methods can offer improved 

^Many resources discussing epistemic uncertainty can be found, at the time of writing, at http:// 
understandinguncertainty. org (the authors are not affiliated with this web page) 

^In fact, the very same connection between least-squares estimation and Gaussian inference has been re-discovered repeatedly, 
simply because least-squares estimation has been re-discovered repeatedly after Legendre, under names like ridge regression 
[18], Kriging [25], Tikhonov's method [44], and so on. The fundamental connection is that the normal distribution is the 
exponential of the square £2 norm. Because the exponential is a monotonic function, minimizing an .^ 2 -regularized £2 loss 
is equivalent to maximizing the product of Gaussian prior and likelihood. In this sense, this paper is adding numerical 
mathematicians to the list as yet another group of re-discoverers of Gaussian inference. Ironically, this list includes Gauss 
himself, because Gaussian elimination, introduced in the very same paper as the Gaussian distribution itself [7], can be 
interpreted as a conjugate-direction method [17], and thus as Gaussian regression [14]. See also [41]. 

^The probabilistic numerics community web site can be found at http : //www .probabilistic-numerics . org. 
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performance, enticing new functionality, and conceptual clarity; we demonstrate this with examples 
drawn from astrometry and computational photography. The article closes by pointing out the 
propagation of uncertainty through computational pipelines as a guiding goal for probabilistic 
numerics. 

It may be helpful to separate the issues discussed here clearly from other areas of overlap 
between statistics and numerical mathematics: We here focus on well-posed deterministic problems, 
identifying degrees of uncertainty arising from the computation itself. This is in contrast to the 
notion of uncertainty quantification [e.g. 22, 23], which identifies degrees of freedom in ill-posed 
problems, and where epistemic uncertainty arises from the set-up of the computation, rather than 
the computation itself. It also differs from several concepts of stochastic numerical methods, which 
use (aleatory) random numbers either to quantify uncertainty from repeated computations [e.g. 
20], or to reduce computational cost through randomly chosen projections [see for example 11, 28]. 

It is also important to note that, as a matter of course, existing frameworks already analyse and 
estimate the error created by a numerical algorithm. Theoretical analysis of computational errors 
generally yields convergence rates—^bounds up to an unknown constant—made under certain 
structural assumptions. Where probabilistic numerical methods are derived from classic ones, as 
described below, they naturally inherit such analytical properties. At runtime, the error is also 
estimated for the specific problem instance. Such run-time error estimation is frequently performed 
by monitoring the dynamics of the algorithm's main estimate [see 10, §11.4, for ODEs], [3, §4.5, for 
quadrature]. Similarly, in optimization problems, the magnitude of the gradient is often used to 
monitor the algorithm's progress. Such error estimates are informal, as are the solution estimates 
computed by the numerical method itself. They are meant to be used locally, mostly a criteria for 
the termination of a method, and the adaptation of its internal parameters. They can not typically 
be interpreted as a property (e.g. the variance) of a posterior probability measure, and thus can 
not be communicated to other algorithms, and thus can not be embedded in a larger framework 
of error estimation. They also do not usually inform the design of the numerical algorithm itself; 
instead, they are a diagnostic tool added post-hoc. Below, we argue that the estimation of errors 
should be given a formal framework, and that probability theory is uniquely suited for this task. 
Describing numerical computations as inference on a latent quantity yields a joint, consistent, 
framework for the construction of solution and error estimates. The inference perspective can 
provide a natural intuition that may suggest extensions and improvements. And the probabilistic 
framework provides a lingua franca for numerical computations, which allows the communication 
of uncertainty between methods in a chain of computation. 


2. Probabilistic numerical methods from classical ones 

Numerical algorithms estimate quantities not directly computable, using the results of more readily 
available computations. Even existing numerical methods can thus be seen as inference rules, 
reasoning about latent quantities from "observables", or "data". At face value, this connection 
between inference and computation is vague. But several recent results have shown that it can 
be made rigorous, such that established deterministic rules for various numerical problems can 
be understood as maximum a-posteriori estimates under specific priors (hypothesis classes) and 
likelihoods (observation models). The recurring picture is that there is a one-to-many relationship 
between a classic numerical method for a specific task and a family of probabilistic priors which 
give the same maximum posterior estimate but differing measures of uncertainty. Choosing one 
member of this family amounts to fitting an uncertainty, a task we call uncertainty calibration. 
The result of this process is a numerical method that returns a point estimate surrounded by a 
probability measure of uncertainty, such that the point estimate inherits the proven theoretical 
properties of the classic method, and the uncertainty offers new functionality. 
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(a) Quadrature 

We term the Probabilistic Numeric approach to quadrature Bayesian quadrature. Diaconis [4] may 
have been first to point out a clear connection between a Gaussian process regression model and 
a deterministic quadrature rule, an observation subsequently generalised by Wahba [47, §8] and 
O'Hagan [34], and also noted by Rasmussen & Ghahramani [39]. Details can be found in these 
works; here we construct an intuitive example highlighting the practical challenges of assigning 
uncertainty to the result of a computation. For concreteness, consider f(x) = exp [- sin^ (3a:) - a:^] 
(black in Figure 1, top). Evidently, / has a compact symbolic form and f(x) can be computed 
for virtually any a: c R in nanoseconds. It is a wholly deterministic object. Nevertheless, the real 
number 

F = f_J{x)dx (2.1) 

has no simple analytic value, in the sense that it can not be natively evaluated in low-level code. 
Quadrature rules offer "black box" estimates of F. These rules have been optimized so heavily [e.g. 
3] that they could almost be called "low-level", but their results do not come with the strict error 
bounds of floating-point operations; instead, assumptions about / are necessary to bound error. 
Perhaps the simplest quadrature rule is the trapezoid rule, which amounts to linear interpolation 
of / (red line in Figure 1, top left): Evaluate f(xi) on a grid -3 = a:i < X 2 < • • • < xj^ = 3 of N points, 
and compute 

^ 1 

-^midpoint" XI o [/(^O 

i=2 ^ 



(b) Bayes-Hermite Quadrature 

A probabilistic description of F requires a joint probability distribution over / and F. Since / 
lies in an infinite-dimensional Banach or Hilbert space, no Lebesgue measure can be defined. But 
Gaussian process measures can be well-defined over such spaces, and offer a powerful framework 
for quadrature [38,47]. In particular, we may choose to model / on [-3,3] by p(f) = GV{f; 0, k), 
a Gaussian process with vanishing mean /x = 0 and the linear spline covariance function [29] (a 
stationary variant of the Wiener process) 

k(x,x') = c(l + b - ^/3b\x - x'\), for some c, 5> 0. (2.3) 


More precisely, this assigns a probability measure over the function values f(x) on [-3,3], such 
that any restriction to finitely many evaluations y = [/(a:i),..., /(x^vf)] at A = [xi^ ..., Xjv] is 
jointly Gaussian distributed with zero mean and covariance cov[/(xQ, f{xj)\ = k{xi^Xj). Samples 
drawn from this process are shown in grey in the top left of Figure 1. These sample paths represent 
the hypothesis class associated with this Gaussian process prior: They are continuous, but not 
differentiable, in mean square expectation [1, § 2.2]. Because Gaussian processes are closed under 
linear projections [e.g. 1, p. 27], this distribution over / is identified with a corresponding univariate 
Gaussian distribution [29], 


p{F)=M 


rr 3 


(2.4) 


on F. The strength of this formulation is that it provides a clear framework under which 
observations f(xi ) can be incorporated. The measure on p(F) conditioned on the collected function 
values is another scalar Gaussian distribution 


p{F\y)=J\f\F- j\{x,X)K~^ydx, Jj\{x,x)-k{x,X)K~^k{X,x)dxdx 


(2.5) 


where k(x,X) = [k(x, xi),..., k(x, xjsf)] and K is the symmetric positive definite matrix with 
elements Kij = k(xi,Xj). The final expression in the brackets, the posterior covariance can be 
optimized with respect to X to design a "most informative" dataset. For the spline kernel (2.3), this 
leads to placing A on a regular, equidistant, grid [29]. 
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Figure 1 . Quadrature rules, illustrating the challenge of uncertainty-calibration. Left two columns: Top row: Function f(x) 
(black line), is approximately integrated using two different Gaussian process priors (left: linear spline; right: exponentiated 
quadratic), giving posterior distributions and mean estimates. Gray lines are functions sampled from the prior. The thick 
coloured line is the posterior mean, thin lines are posterior samples, and the delineation of two marginal standard 
deviations. The shading represents posterior probability density. Bottom row: As the number of evaluation points increases, 
the posterior mean (thick line with points) converges to the true integral value; note the more rapid convergence of the 
exponentiated quadratic prior. The posterior covariance provides an error estimate whose scale is defined by the posterior 
mean alone (each thin coloured line in the plots corresponds to a different instance of such an estimate). But it is only a 
meaningful error estimate if it is matched well to the function’s actual properties. The left plot shows systematic difference 
between the convergence of the real error and the convergence of the estimated error under the linear spline, whereas 
convergence of the estimated error under the exponentiated quadratic prior is better calibrated to the real error. Grey 
grid lines in the background, bottom left, correspond to 0(N~^) convergence of the error in the number N of function 
evaluations. Right two columns: The same experiment repeated with a function / drawn from the spline kernel prior. For 
this function, the trapezoid rule is the optimal statistical estimator of the integral (note well-calibrated error measure in 
bottom row, third panel), while the Gaussian kernel GP is strongly over-confident. 


The spline kernel k of (2.3) is a piecewise linear function with a single point of non¬ 
differentiability at X = x'. The posterior mean k(x,X)K~^y is a weighted sum of N such kernels 
centred on the evaluation points X and constrained by the likelihood to pass through the values 
y at the nodes X. Thus, the posterior mean is a linear spline interpolant of the evaluations, 
and the posterior mean of Equation (2.5) is exactly equal to the trapezoid-rule estimate from 
X [see also 4]. That is, Bayesian quadrature with a linear spline prior provides a probabilistic 
interpretation of the trapezoidal rule; it supplements the estimate with a full probability distribution 
characterising uncertainty. The corresponding conditional on / is a Gaussian process whose 
mean is the piecewise linear red function in the top left of Figure 1 (the Figure also represents 
the conditional distribution p(/ | /(xi),..., /(xjv)) as a red cloud, and some samples). Various 
other, more elaborate quadrature rules (e.g. higher order spline interpolation and Chebyshev 
polynomials) can be cast probabilistically in analogous ways, simply by changing the covariance 
function k [4, 34, 47]. 
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(c) Added value, and challenge, of probabilistic output 

A non-probabilistic analysis of the trapezoid-rule is that, for sufficiently regular functions /, this 
rule converges at a rate of at least 0{N~^) to the correct value for F as N increases [e.g. 3, Eq. 2.1.6] 
(other quadrature rules, like cubic splines, have better convergence for differentiable functions). 
Figure 1 bottom left shows the mean estimate (equivalently, the trapezoid-rule) as a thick line. Grey 
help lines represent convergence. Clearly, the asymptotic behaviour is steeper than those 

linesSo the theoretical bound is correct, but it is also of little practical value: No linear bound is 
tight from start to convergence. 

On the probabilistic side, the standard deviation of the conditional (2.5) is regularly interpreted 
as an estimate of the imprecision of the mean estimate. In fact, this error estimate (thin red lines 
in Figure 1, top left) is analogous to the deterministic linear convergence bound for continuous 
functions. There is a family of such error bounds associated with the same posterior mean, with 
each line corresponding to a different value for the unknown constant in the bound and different 
values for the parameters b and c in the covariance.^ It may seem as though the probabilistic 
interpretation had added nothing new, but since this view identifies quadrature rules of varying 
assumptions as parameter choices in Gaussian process regression, it embeds seemingly separate 
rules in a hierarchical space of models, from which models with good error modelling can be 
selected by hierarchical probabilistic inference. This can be done without collecting additional 
data-points [38, §5]. 

More generally, the probabilistic numeric viewpoint provides a principled way to manage the 
parameters of numerical procedures. Where Markov Chain Monte Carlo procedures might require 
the hand-tuning of parameters such as step sizes and annealing schedules, Bayesian quadrature 
allows the machinery of statistical inference procedure to be brought to bear upon such parameters. 
A practical example of the benefits of approximate Bayesian inference for the (hyper-)parameters 
of a Bayesian quadrature procedure is given by [36]. 

The function / used in this example is much smoother than typical functions under the Gaussian 
process prior distribution associated with the trapezoid rule (shown as thin grey samples in the 
top left plot). The second column (using orange) of Figure 1 shows analogous experiments with 
the exponentiated quadratic covariance function k(x, x) = 6^ exp(-(a: - x)^l\^), corresponding 
to a very strong smoothness assumption on / [45] (see grey samples in top right plot), giving a 
very quickly converging estimate. In this case, the ''error bars" provided by the standard deviation 
converge in a qualitatively comparable way. 

Of course, the faster convergence of this quadrature rule based on the exponentiated quadratic- 
covariance prior is not a universal property. It is the effect of a much stronger set of prior 
assumptions. If the true integrand is rougher than expected under this prior, the quadrature 
estimate arising from this prior can be quite wrong. The right two columns of Figure 1 show 
analogous experiments on an integrand that is a true sample from a Gaussian process with 
the spline covariance (2.3). In this case, the spline prior is the optimal statistical estimator by 
construction, and its error estimate is perfectly calibrated (third column of Figure 1), while the 
exponentiated quadratic-kernel gives over-confident, and inefficient estimates (right-most column). 

Identifying the optimal regression model from a larger class, just based on the collected function 
values, requires more computational work than to fix a regressor from the start. But it also gives 
better calibrated uncertainty. Gontemporary general-purpose quadrature implementations [e.g. 
3] remain lightweight by recursively re-using previous computations. The above experiments 
show that it is possible to design Bayesian quadrature rules with well-calibrated posterior error 
estimates, but it remains a question how small the computational overhead from a probabilistic 
computation over these methods can be made. Even so, formulating quadrature as probabilistic 
regression precisely captures a trade-off between prior assumptions inherent in a computation and 

^This, too, is a well-known result: If the integrand is differentiable, rather than just continuous, the trapezoid rule has 
quadratic convergence rate [3, Eq. 2.1.12]. 

^ The initial behaviour of the red lines in the figure is a function of the scale b, which relates to assumptions about the rate at 
which the asymptotic quadratic convergence is approached. 
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the computational effort required in that computation to achieve a certain precision. Computational 
rules arising from a strongly constrained hypothesis class can perform much better than less 
restrictive rules if the prior assumptions are valid. In the numerical setting—in contrast to many 
empirical situations in statistics—it is often possible to precisely check whether a particular prior 
assumption is valid: The machine performing the computation has access, at least in principle, 
to a formal, complete, description of its task, in form of the source code describing the task (the 
integrand, the optimization objective, etc.). Using this source code, it is for example to test, at 
runtime, whether, and how many times, an integrand is continuously differentiable [e.g. 8]. Using 
this information will in future allow the design of improved quadrature methods. Once one knows 
that general purpose quadrature methods effectively use a Gaussian process prior over function 
values, it is natural to ask whether this prior actually incorporates the salient information available 
for one's specific problem. Including such information in the prior leads to customized, "tailored", 
numerical methods that can perform better. 

Probabilistic inference also furnishes the framework required to tackle numerics tasks using 
decision theory. For quadrature, the decision problem to be solved is that of node selection: that is, 
the determination of the optimal positions for points (or nodes) at which to evaluate the integrand. 
With the definition of an appropriate loss function, such as the posterior variance of the integral, 
such nodes can be optimally selected by minimising the expected loss function. It seems clear 
that this approach can improve upon the simple uniform grids of many traditional quadrature 
methods, and can enable active learning: where surprising evaluations can influence the selection 
of future nodes. 

This decision-theoretic approach stands in contrast with that adopted by randomised, Monte 
Carlo, approaches to quadrature. Such approaches generate nodes with the use of a pseudo-random 
number generator, injecting additional epistemic uncertainty (about the value of the generator's 
outputs) into a procedure designed to reduce the uncertainty in an integral. It is worth noting 
that the use of pseudo-random generators burdens the procedure with additional computational 
overhead: pseudo-random numbers are cheap, but not free. The principal feature of Monte Carlo 
approaches is their conservative nature: the Monte Carlo policy will always, eventually, take an 
additional node arbitrarily close to an existing node. The disadvantage of this strategy is its waste 
of valuable evaluations: the convergence rate of Monte Carlo techniques, is clearly 

improved upon by both traditional quadrature and Bayesian quadrature methods. This problem 
is only worsened by the common discarding of evaluations known as 'burn-in' and 'thinning'. 
The advantage of Monte Carlo, of course, is its robustness to even highly non-smooth integrands. 
However, Bayesian quadrature can realise more value from evaluations by exploiting known 
structure (e.g. smoothness) in the integrand. 



(d) Empirical evaluation of Bayesian quadrature for astrometry 

We illustrate this on an integration problem drawn from astrometry, the measurement of the 
motion of stars. In order to validate astrometric analysis, we aim to recover the number of planets 
present in synthetic data generated (with a known number of planets) to mimic that produced 
by an astrometric facility such as the GAIA satellite [43].^ Here, quadrature's task is to compute 
the model evidence (marginal likelihood) Z = f p{V \0,Ai) p(0 | A4) d^ of a model M for orbital 
motions, where T> are the gathered observations and 0 are the model parameters. Specifically, it 
is of interest to compare the evidence for models including differing numbers of exoplanets. For 
the following example, to provide a focal problem upon which to compare quadrature methods, 
we compute the evidence of a model with two such planets on data generated with a two-planet 
model. The corresponding integral is analytically intractable, with a multi-modal integrand (the 
likelihood p(V\0,M)) and a 19-dimensional 0 rendering the quadrature problem challenging. 

As the probabilistic method, we use a recent algorithm: Warped, Sequential, Active Bayesian 
Integration (WSABI^) [9]. WSABI is a Bayesian quadrature algorithm that uses an internal 

^The authors are grateful to H. Parviainen and S. Aigrain for providing data and code examples. 

^Matlab code for WSABI is available at https : / /github. com/Oxf ordML/wsabi. 
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Figure 2. We compare different quadrature methods in computing the 19-dimensional integral giving model evidence 
in an astrometry application. The Bayesian quadrature algorithm (WSABI) that employs active selection of nodes, along 
with prior knowledge of the smoothness and non-negativity of the integrand, converges faster than the Monte Carlo 
approaches: simple Monte Carlo (SMC) and Annealed Importance Sampling (AIS). Note that Bayesian Monte Carlo 
(BMC) is a Bayesian quadrature algorithm that uses the same samples as SMC, explaining their similar performance. 
Note also the performance improvement offered by WSABI over BMC, suggesting the crucial role played by the active 
selection of nodes. 


probabilistic model that is well-calibrated to the suspected properties of the problem's integrand. 
Firstly, like the exponentiated quadratic, WSABFs covariance function is suitable for smooth 
integrands, as are expected for the problem. Secondly, and beyond what is achievable with classic 
quadrature rules, WSABI explicitly encodes the fact that the integrand (the likelihood 'p{V\0,M)) 
is strictly positive. WSABI also makes use of a further opportunity afforded by the probabilistic 
numeric approach: it actively select nodes so as to minimise the uncertainty in the integral. This 
final contribution permits nodes to be selected that are far more informative than gridded or 
randomly selected evaluations. We compare this algorithm against two different Monte Carlo 
approaches to the problem: Annealed Importance Sampling (AIS) [31] (which was implemented 
with a Metropolis-Hastings sampler) and simple Monte Carlo (SMC). We additionally compared 
against Bayesian Monte Carlo (BMC) [39], a Bayesian quadrature algorithm using the simpler 
exponentiated quadratic model, and whose nodes were taken from the same samples selected by 
SMC. "Ground truth" (Ztme) was obtained through exhaustive SMC sampling (10^ samples). The 
results in Figure 2 show that the probabilistic quadrature method achieves improved precision 
drastically faster than the Monte Carlo estimates. It is important to point out that the plot's abscissa 
is "wall-clock" time, not algorithmic steps. Probabilistic algorithms need not be expensive. 


(e) Linear Algebra 

Computational linear algebra covers various operations on matrices. We will here focus on linear 
optimization, where b € is known and the task is to find x e R^ such that Ax = b where 
A G R^^^ is symmetric positive definite. We assume access to projections As for arbitrary s g R^. 
If N is too large for exact inversion of A, a widely known approach is the method of conjugate 
gradients (CG) [17], which produces a convergent sequence xq, ..., xm of improving estimates for 
X. Each iteration involves one matrix-vector multiplication and a small number of linear operations, 
to produce the update Si = Xi- and an "observation" = Asi. CG's good performance has 
been analysed extensively [e.g. 32, §5.1]. 

We will use the shorthands Sm = [^i? • • •, ^m] ^od = [Vi , • • •, Vm] ^or the set of projection 
directions and "observations" after M iterations of the method. Defining a probabilistic numerical 
algorithm requires a joint probability measure p{Ym,A, b, x \ Sm) over all involved variables, 
conditioned on the algorithm's active design decisions; and an "action rule" (design, policy, control 
law) describing how the algorithm should collect data. Recent work by Hennig [14] describes 
such a model which exactly reproduces the sequence of CG: As prior, choose a 
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Gaussian measure over the (assumed to exist) inverse H = A ^ of A 


p(H} =M(H\Ho,r{W ® w)r''). 
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( 2 . 6 ) 


This implies a Gaussian prior over x = Hb, and a non-Gaussian prior over A. In (2.6), J\f is a 
Gaussian distribution over the elements of H stacked into a vector H. F is the symmetrization 

operator (FA = y2{A + A^)), and 0 is the Kronecker product {F{W 0 W)F^ is also known as 
the symmetric Kronecker product [46]). As projections are presumed noise-free, the likelihood is 
the Dirac distribution, the limit of a Gaussian with vanishing width, which can also be seen as 
performing a strict conditioning of the prior on the observed function values. 


piXM I A, b, a:, ^m) = n “ yi) = n “ ^Vi)- (2.7) 

The action rule at iteration i is to move to cc^+i = - aHi{Axi - b), where Hi is the posterior 

mean p{H \ Im)- The optimal step a can be computed exactly in a linear computation. 

Eq. (2.6) requires choices for Hq and W. The prior mean is set to unit, Hq = I.liW e 
is positive definite, then p{H) assigns finite measure to every symmetric H eR^^^, and the 
algorithm converges to the true x in at most N steps, assuming exact computations [14]. In this 
sense it is non-restrictive, but of course some matrices are assigned more density than others. If W 
is set to a value among the set {W e R^^^ \ W symm. positive definite and WYm = cfSm^cf € R+} 
(this includes the true matrix W = H, but does not require access to it), then the resulting algorithm 
exactly reproduces the iteration sequence {^Ci}i=o,...,Ar of the CG method, and can be implemented 
in the exact same way. However, this derivation also provides something new: A Gaussian posterior 
distribution p{H \ y) Hm, ^{Wm ® over H. Details can be found in [14], for the 

present discussion it suffices to know that both the posterior mean Hm and covariance parameter 
Wm are of manageable form. In particular, Hm = I + UEU^ with a diagonal matrix E e ]^2 Mx 2M 
and '"skinny" matrices U c which are a function of the steps s and observed projections y. 

So, as in the case of quadrature, there is a family of Gaussian priors of varying width (scaled by 
cr), such that all members of the family give the same posterior mean estimate. And this posterior 
mean estimate is identical to a classic numerical method (CG). But each member of the family 
gives a different posterior covariance—a different uncertainty estimate. 

It is an interesting question to which degree the uncertainty parameter Wm can be designed to 
give a meaningful error estimate. Some answers can be found in [14]. Interestingly, the equivalence- 
class of prior covariances Wq that match CG in the mean estimate has more degrees of freedom 
than the number M of observations (matrix-vector multiplications) collected by the algorithm 
during its typical runtime. Fitting the posterior uncertainty Wm thus requires strong regularization. 
The method advocated in [14] constructs such a regularized estimator exclusively from scalar 
numbers already collected during the run of the CG method, thus keeping computational overhead 
very small. 

But, as in quadrature, there are valuable applications for the probabilistic formulation that do 
not strictly require a well-calibrated width of the posterior. Applications that make primary use 
of the posterior mean may just require algebraic structure in the prior, up to an arbitrary scaling 
constant, to incorporate available, helpful, information. We will not do this here and instead 
highlight another use of probabilities over point estimates: the propagation of knowledge from 
one linear problem to another related problem. 

The approach described in the following is known in the numerical linear algebra community 
as the recycling of Krylov sequences [37]. However, while the framework of classic numerical analysis 
required challenging and bespoke derivation of this result, it follows naturally in the probabilistic 
viewpoint as the extension of a parametric regressor to a filter on a time-varying process. The 
probabilistic formulation of computation uses the universal and unique language of inference to 
enable the solution of similar problems across the breadth of numerics using similar techniques. 
This is in contrast to the compartmentalised state of current numerics, which demands distinct 
expert knowledge of each individual numeric problem in order to make progress towards it. 
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Further, there is social value in making results accessible to any practitioner with a graduate 
knowledge of statistics. 

In many set-ups, the same A features in a sequence of problems Axt = bt,t = i, _In others, 

a map At changes slightly from step t to step t + 1. Figure 3 describes a blind deconvolution 
problem from astronomical imaging:^ The task is to remove an unknown linear blur from a 
sequence pi ,of astronomical images [12]. Atmospherical disturbances create a blur that 
continuously varies over time. The model is that each frame is the noisy result of convolution of 
the same ground truth image x with a spatially varying blur kernel /^, i.e. yt = ft x + n^, where 
rii is white Gaussian noise. A matrix X encodes the convolution operation as yt = X + rii (it 
is possible to ensure X is positive definite). A blind deconvolution algorithm iterates between 
estimating X and estimating the [12]. Each iteration thus requires the solution of K linear 
problems (with fixed X) to find the /^, then one larger linear problem to find x. The naive 
approach of running separate instances of CG wastes information, because the K linear problems 
all share the matrix X, and from one iteration to the next, the matrix X changes less and less 
as the iterations approach convergence. Instead, information can be propagated between related 
computations using the probabilistic interpretation of CG, by starting each computation in the 
sequence with a prior mean Hq set to the posterior mean Hm of the preceding problem. Due to the 
low-rank structure of Hm, this has low cost. To prevent a continued rise in computational costs as 
more and more linear problems are solved, Hm can be restricted to a fixed rank approximation 
after each inner loop, also at low cost. The plots in the lower half of Figure 3 show the increase in 
computational efficiency: For the baseline of solving 40 linear problems independently in sequence, 
each converges about equally fast (top plot, each '"jump" is the start of a new problem). The lower 
plot shows optimization progress of the same 40 problems when information is propagated from 
one problem to the next. The first problem amounts to standard CG, while subsequent iterations 
can make increasingly better use of the available information. The Figure also shows the dominant 
eigenvectors of the inferred posterior means of X~^ after K = 40 subsequent linear problems, which 
converge to a relatively generic basis for point-spread functions. Although not strictly correct, this 
scheme can be intuitively understood as inferring a pre-conditioner across the sequence of problems, 
by Bayesian filtering (the technical caveat is that the described scheme does not re-scale the linear 
problem itself, as a pre-conditioner would, it just shifts the initial solution estimate). 


(i) Further Areas 

These examples highlight the areas of quadrature and linear algebra. Analogous results, identifying 
existing numerical methods with maximum a-posteriori estimators have been established in other 
areas, too. We do not experiment with them here, but they help complete the picture of numerical 
methods as inference across problem boundaries: 

In non-linear optimization, quasi-Newton methods like the BFGS rule are deeply connected 
to conjugate gradients (in linear problems, BFGS and CG are the same algorithm [30]). The BFGS 
algorithm can be interpreted as a specific kind of autoregressive generalization of the Gaussian 
model for conjugate gradients [16]. Among other things, this allows explicit modelling of noise on 
evaluated gradients, a pressing issue in large-scale machine learning [13]. 

One currently particularly exciting area for probabilistic numerics is ordinary differential 
equations, more specifically the solution of initial value problems (IVPs) of the form c^®/dt(t) = 
f{x{t),t), where x: R+ ^ is a real-valued curve parametrised by t known to start at the initial 
value x(to) = xo. Explicit Runge-Kutta methods are a basic and well-studied tool for such problems 
[10] —while not necessarily state of the art, they nevertheless perform well on many problems, 
and are conceptually very clear. From the inference perspective, Runge-Kutta methods are linear 
extrapolation rules. At increasing nodes to < < * * * < < G they repeatedly construct ''estimates" 

x(ti) x(ti) for the true solution which is used to collect an "observation" y^ = f{x(ti),ti), such 


The authors thank S. Harmeling, M. Hirsch, S. Sra and B. Scholkopf [12] for providing data. 
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observed images 



eigen-vectors of inferred approximation to X ^: 




Figure 3. Solving sequences of linear problems by utilizing the probabilistic interpretation of conjugate gradients, also 
known as the recycling of Krylov sequences. Top: A sequence of observed astronomical images is modelled as 
the convolution of a stationary true image x with a time-varying point-spread function f- (the matrix X encodes the 
convolution with x). Each individual deconvolution task requires one run of a linear solver, here chosen to be the method 
of conjugate gradients. Bottom plots: If each problem is solved independently, each instance of the solver progresses 
similarly (top plot. Optimization residuals over time. Each “jump” in the residual is the start of a new frame/deconvolution 
problem). If the posterior mean implied by the probabilistic interpretation of the solver is communicated from one problem 
to the next, the solvers progress increasingly faster (bottom plot, note decreasing number of steps for each deconvolution 
problem, and decrease, by about one order of magnitude, of initial residuals). Middle row: Over time, the vectors spanning 
this posterior mean for X~^ converge to a generic basis for point-spread functions. 


that the estimate is a linear combination of previous observations: 

x{ti) = xo + ( 2 . 8 ) 

j<i 

Crucially, the weights wij are chosen such that, after s evaluations, the estimate has high 
convergence order: \\x{ts) - x{ts)\\ = with an order p<s (typically, p< 5). 

It is important to point out the central role that linear extrapolation, and linear computations 
more generally, play again here, as they did in the other numerical settings discussed above. It is 
not an oversimplification to note that numerical methods often amount to efficiently projecting an 
intractable problem into a tractable linear computation. Since the Gaussian family is closed under 
all linear operations, it is, perhaps, no surprise that Gaussian distributions play a central role in 
probabilistic re-interpretations of existing numerical methods. In the case of initial value problems, 
Gaussian process extrapolation was previously suggested by Skilling [42] as a tool for their solution. 
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Starting from scratch. Skilling arrived at a method that shares the linear structure of Eq. (2.8), 
but does not have the strong theoretical underpinning of Runge-Kutta methods, in particular. 
Skilling's method does not share the high convergence order of Runge-Kutta methods. But the 
probabilistic formulation allows for novel theoretical analysis of its own [2], and new kinds of 
applications, such as the marginalization over posterior uncertainty in subsequent computational 
steps and finite prior uncertainty over the initial value [15]. A considerably vaguer but much earlier 
observation was made, on the side of numerical mathematics, by Nordsieck [33], who noted that a 
class of methods he proposed, and which were subsequently captured in a wider nomenclature of 
ODE solvers, bore a resemblance to linear electrical filters. These, in turn, are closely connected to 
Gaussian process regression through the notion of Markov processes. 

Recently, Schober et al. [40] showed that these connections between Gaussian regression and 
the solution of IVP's, hitherto only conceptual, can be made tight. There is a family of Gauss- 
Markov priors that, used as extrapolation rules, give posterior Gaussian processes whose mean 
function exactly matches members of the Runge-Kutta family. (Thanks to its Markov property, 
the corresponding inference method can be implemented as a signal filter, and thus in linear 
computational complexity, like Runge-Kutta methods). Hence, as in the other areas of numerics, 
there is now a family of methods that returns the trusted point estimates of an established method, 
while giving a new posterior uncertainty estimate allowing new functionality. 


3. Discussion 

(a) General Recipe for Probabilistic Numerical Algorithms 

These recent results, identifying probabilistic formulations for classic numerical methods, highlight 
a general structure. Consider the problem of approximating the intractable variable z, if the 
algorithm has the ability to choose 'inputs' x = {Xi}i = l,... for computations that result in numbers 
y{x) = {yi(iCi)}i=i,.... A blueprint for the definition of probabilistic numerical methods requires 
two main ingredients: 

(i) A generative model p{z, y{x)) for all variables involved—that is, a joint probability measure 
over the intractable quantity to be computed, and the tractable numerical quantities 
computed in the process of the algorithm. Like all (sufficiently structured) probability 
measures, this joint measure can be written as 

p{z,yix))=piz)p{yix)\z), (3.1) 

i.e. separated into a prior p{z) and a likelihood p(y{x)\z). The prior encodes a hypothesis 
class over solutions, and assigns a typically non-uniform measure over this class. The 
likelihood explains how the collected tractable numbers y relate to z. It has the basic role 
of describing the numerical task. Often, in classic numerical problems, the likelihood is a 
deterministic conditioning rule, a point measure. 

(ii) A design, action rule, or policy r, such that 

Xi+i=r(^p{z,yix)),xi:i,yi,iy (3.2) 

encoding how the algorithm should collect numbers. (Here xi:i should be understood 
as the actions taken in the preceding steps 1 to i, and similarly for t/i;^). This rule can be 
simple, for example it could be independent of collected data (regular grids for integration). 
Or it might have a Markov-type property that the decision at i only depends on k<i 
previous decisions (for example in ODE solvers). Sometimes, these rules can be shown to 
be associated with the minimization of some empirical loss function, and thus be given a 
decision-theoretic motivation. This is for example the case for regular grids in quadrature 
rules [29]. 
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Problem class 

integration 

linear opt. 

nonlinear opt. 

ODE IVPs 

inferred z 

« = /;/ f{x)dx 

II 

I 

II 

z = B = VV'^f 

z'{t) = f{z{t),t) 

classic method 

Gaussian quad. 

conjugate gradients 

BFGS 

Runge-Kutta 

p{z) 

QVU\y,k) 


gv{z-,n,k) 

gv{z-,ij,,k) 

p{y\z) 

Kfi.Xi) = yi) 

l{yi = Axi) 

l{yi = Bxi) 

I{yi = z'it)) 

decision rule 

minimize 

gradient at 

gradient under 

evaluate at 


post, variance 

est. solution 

est. Hessian 

est. solution 


Table 1. Probabilistic description of several basic numerical problems (shortened notation for brevity). In quadrature, 
(symmetric positive definite) linear optimization, non-linear optimization, and the solution of ordinary differential equation 
initial value problems, classic methods can be cast as maximum a-posteriori estimation under Gaussian priors. In each 
case, the likelihood function is a strict conditioning, because observations are assumed to be noise-free. Because 
numerical methods are active (they decide which computations to perform), they require a decision rule. This is often 
“greedy”: evaluation under the posterior mean estimate. The exception is integration, which is the only area where the 
estimated solution of the numerical task is not required to construct the next evaluation. 


The aforementioned results show that classic, base-case algorithms for several fundamental 
numerical problems can be cast as maximum a-posteriori inference in specific cases of this 
description; typically under Gaussian priors, and often under simple action rules, like uniform 
gridding (in quadrature) or greedy extrapolation (in linear algebra, optimization and the solution 
of ODEs). Table 1 gives a short summary 

(b) Current Limitations 

Numerical methods have undergone centuries of development and analysis. The result is a mature 
set of algorithms that have been ingrained in the scientific tool-set. By contrast, the probabilistic 
viewpoint suggested here is an emerging area. Many questions remain unanswered, and many 
aspects of practical importance are missing: Formal analysis is at an early stage. Efficient and stable 
implementations are still in development. Convincing use-cases from various scientific disciplines 
are only beginning to emerge. We hope that the reader will take these issues as a motivation to 
contribute, rather than a hold-up. As the use of large scale computation, simulation, and the use of 
data permeate the quantitative sciences, there is clearly a need for a formal theory of uncertainty 
in computation. 


(c) New Paths for Research 

In our opinion, the match between probabilistic inference and existing numerical methods lays 
a firm foundation for the analysis of probabilistic numerical methods. We see two primary, 
complementary goals: 

First, implicit prior assumptions can now be questioned. This could be done in an ''aggressive" 
way, in the hope of finding either algorithms with faster convergence on a smaller set of problems 
satisfying stronger assumptions (as in the quadrature example of Section (b)). Conversely, 
a "conservative" re-definition of prior assumptions might improve robustness at increased 
computational cost. A particularly important aspect in this regard is the action rule r. Wherever r 
is a function of previously collected 'data' (known as adaptive design in statistics and active learning 
in machine learning), a bias can occur. Where the collected data also influence the result of future 
actions (as in ODE solvers), a more severe problem, an exploration-exploitation trade-off can arise. 
Checking for such biases, and potentially correcting them, can increase computational cost. But in 
some applications that require high robustness this effort can pay off. 

Secondly the modelling assumptions, in particular the likelihood, can be extended to increase 
the reach of existing methods to new settings. A first point of interest is the explicit modelling 
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of uncertainty or noise on the evaluations themselves. This generalization, which would be 
challenging to construct from a classical standpoint, is often straightforward once a probabilistic 
interpretation has been found. It may be as simple as replacing the point-mass likelihood functions 
in Table 1 with Gaussian distributions. A prominent case of this aspect is the optimization of noisy 
functions, such as it arises for example in the training of large-scale machine learning architectures 
from subsets of a large or infinite dataset. 

Other ideas, like the propagation of knowledge between problems, as in Figure 3, are just as 
difficult to motivate and study in a classic formulation, but suggest themselves quite naturally in 
the probabilistic formulation. 

There are also practical considerations that shape the research effort. Gaussian distributions 
play an important role, at least where the inferred quantity is continuous valued. This is not 
incidental: To a large degree, the point of a numerical method is to turn an intractable computation 
into a sequence of linear computations. The Gaussian exponential family is closed under linear 
projections, thus ideally suited for this task. 

Efficient adaptation of model hyper-parameters is crucial for a well-calibrated posterior measure. 
Models with fixed parameters often simply reproduce existing analytic bounds; only through 
parameter adaptation can uncertainty be actively "Titled". Doing so is perhaps more challenging 
than elsewhere in statistics because numerical methods are "inner-loop" algorithms used to solve 
more complex, higher-level computations. It is important to find computationally lightweight 
parameter estimation methods, perhaps at the cost of accepting some limitations in model flexibility. 

Although the fundamental insight that numerical methods solve inference problems is not 
new, the study of probabilistic numerical methods is still young. Recent work has made progress, 
exposing a wealth of enticing applications in the process. We conclude this text by highlighting a 
most promising, if distant, application motivating ongoing research. 

(d) A Vision: Chained numerical methods communicating uncertainty 

Fueled by ubiquitous collection and communication of data, several academic and industrial fields 
are now interested in systems that use observations to adapt to, and interact with, their data source 
in an autonomous way. Figure 4 shows a conceptual sketch of an autonomous machine aiming 
to solve a given task by using observations (data) D to build a probabilistic model p(xt \D,0t) oi 
variables xt that describe the environment's dynamics through model parameters Ot ■ The model 
can be used to predict future states as a function of actions at chosen by the machine. The goal 
is to choose actions that, over time, maximize some measure of utility that encodes the task. This 
requires a sequence of numerical steps: Inference on x requires marginalization and expectations, 
i.e. integration. Fitting 0 involves optimization. Prediction of xt+st entail solving differential 
equations. All three areas have linear base cases (inference in linear regression, optimization of 
quadratic functions, the solution of linear ODEs). The combination of a sequence of "black-box" 
numerical methods in such automated set-ups gives rise to new challenges. Each method receives 
a point estimate from its precursor, performs its local computation (and adds its local error), and 
hands the result on. Errors can accumulate in unexpected ways along this chain, but modelling their 
accumulation provides value: It may be unnecessary to run a numerical method to convergence if 
its inputs are already known to be only rough estimates. 

Specifically, numerical methods allowing for probabilistic inputs and outputs turn the sketch of 
Figure 4 into a factor graph [6], and allow propagation of uncertainty estimates along the chain of 
computation, through message passing [26]. This would identify sources of computational error, 
allowing: the active management of a computational budget across the chain; the dedication of finite 
computer resources to steps that dominate the overall error; and the truncation of computations 
early once they reach sufficient precision. Uncertainty propagation through computations has been 
studied widely before [27, gives a review], but the available algorithms focus on the effects of 
set-up uncertainties on the outcome of a computation, rather than the computation itself. This new 
functionality explicitly requires calibrated probabilistic uncertainty at each step of the computation, 
at runtime. Classic abstract convergence analyses can not be used for this kind of estimation. 
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Figure 4. Sketch of an autonomous system, collecting data to build a parametrised model of the environment. Predictions 
of future states from the model can be used to choose an action strategy. If intermediate operations are solved by 
numerical methods, both computational errors and inherent uncertainty should be propagated across the pipeline to 
monitor and target computational effort. 


4. Conclusion 

Numerical tasks can be interpreted as inference problems, giving rise to probabilistic numerical 
methods. Established algorithms for many tasks can be cast explicitly in this light. Doing so 
establishes connections between seemingly disparate problems, yields new functionality, and 
can improve performance on structured problems. To allow interpretation of the posterior as a 
statement of uncertainty, care must be taken to ensure well-calibrated priors and models. But 
even where the uncertainty interpretation is not (yet) rigorously established, the probabilistic 
formulation already allows for the encoding of prior information about problem structure, 
including the propagation of collected information among problem instances, leading to improved 
performance. Many open questions remain for this exciting field. In the long run, probabilistic 
formulations may allow the propagation of uncertainty through pipelines of computation, and 
thus the active control of computational effort through hierarchical, modular computations. 
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